strava and Goldencheetah.¶import pandas as pd
import numpy as np
import seaborn as sn
import calendar
from matplotlib import pyplot as plt
from datetime import timedelta
plt.style.use('seaborn')
%matplotlib inline
Importing all the necessary libraries and python packages to be used in the analysis.
The first dataset is an export of my ride data from Strava, an online social
network site for cycling and other sports. This data is a log of every ride since the start of 2018
and contains summary data like the distance and average speed. It was exported using
the script stravaget.py which uses the stravalib module to read data. Some details of
the fields exported by that script can be seen in the documentation for stravalib.
The exported data is a CSV file so that's easy to read, however the date information in the file is recorded in a different timezone (UTC) so we need to do a bit of conversion. In reading the data I'm setting the index of the data frame to be the datetime of the ride.
strava = pd.read_csv('data/strava_export.csv', index_col='date', parse_dates=True)
strava.index = strava.index.tz_convert('UTC')
print("Shape of the Strava Dataframe is:" , strava.shape)
strava.head()
The second dataset comes from an application called GoldenCheetah which provides some analytics services over ride data. This has some of the same fields but adds a lot of analysis of the power, speed and heart rate data in each ride. This data overlaps with the Strava data but doesn't include all of the same rides.
Again we create an index using the datetime for each ride, this time combining two columns in the data (date and time) and localising to Sydney so that the times match those for the Strava data.
cheetah = pd.read_csv('data/cheetah.csv', skipinitialspace=True)
cheetah.index = pd.to_datetime(cheetah['date'] + ' ' + cheetah['time'])
cheetah.index = cheetah.index.tz_localize('Australia/Sydney')
print("Shape of the cheetah dataframe is :", cheetah.shape)
cheetah.head()
The GoldenCheetah data contains many many variables (columns) and I won't go into all of them here. Some that are of particular interest for the analysis below are:
Here are definitions of some of the more important fields in the data. Capitalised fields come from the GoldenCheetah data while lowercase_fields come from Strava. There are many cases where fields are duplicated and in this case the values should be the same, although there is room for variation as the algorithm used to calculate them could be different in each case.
Some of the GoldenCheetah parameters are defined in thier documentation.
strava and Goldencheetah files :¶str_cth_join=strava.join(cheetah,how='inner')
print("Shape of the dataframe after joining strava & cheetah file is:",str_cth_join.shape)
str_cth_join.head()
We have done an inner join between the two given datasets- strava and cheetah and stored it in dataframe str_cth_join. As a result of the inner join we have 243 rows and 372 columns.
result=str_cth_join[(str_cth_join['device_watts']==True)]
result=result.dropna()
print("Shape of the dataframe after removing rides and na values is:", result.shape)
Rides with no measured power are removed i.e., device_watts = True imples it is measured from the power meter whereas False device_watts implies they are estimated so we are just keeping the rows with measured power and also dropping columsn with NaN values.
pd.options.mode.chained_assignment = None
result['elevation_gain'] = pd.to_numeric(result['elevation_gain'].str.split().str[0])
Converting elevation_gain into numeric form so that data can be plotted for this variable.
sn.distplot(result['distance'])
plt.show()
As per the above metalib diagram , distribution of the variable distance is bi- modal. Bi modal distribution has two peaks.
sn.distplot(result['moving_time'])
plt.show()
sn.distplot(result['Average Speed'])
plt.show()
As per the above metalib diagram , distribution of the variable Average_speed is left skewed. Left skewed distribution occurs when mean is less than the median and tail is tilted towards the left.
sn.distplot(result['Average Power'])
plt.show()
As per the above metalib diagram , distribution of the variable Average Power is almost normally distributed.
sn.distplot(result['TSS'])
plt.show()
corr_result =result[["distance", "moving_time","Average Speed","Average Heart Rate","Average Power","NP","TSS","elevation_gain"]]
sn.pairplot(corr_result,diag_kind = 'kde',plot_kws = {'alpha': 0.4, 's': 40, 'edgecolor': 'k'})
Pairplot has clearly shown the correlation between all required variables such as distance, moving_time, Average Speed, Average Power, Average Heart Rate, NP (Normalised Power), TSS and elevation gain.
corr_result=result[['distance','Time Moving','Average Speed','NP','TSS','elevation_gain','Average Heart Rate','Average Power']]
corr_mtx_result=corr_result.corr()
corr_mtx_result
corr_mtx_result is the correlation matrix between the variables aformentioned in the previous cell, negative values are highlighted in the red. Higher the value in the correlation matrix depicts higher the realtion between the two respective variables.
Distance with Time Moving, Distance with TSS, distance with Elevation Gain.
Distance with Average Speed, Distance with Normalized Power, Distance with Average Heart Rate and Distance with Average Power.
Time Moving and Average Power, Average Speed and Time Moving, Average Speed and Elevation Gain etc.
wride=result[(result['workout_type']=='Ride')]
print("Shape of dataframe with workout type only Rides is ", wride.shape)
wrace=result[(result['workout_type']=='Race')]
print("Shape of dataframe with workout type only Race is ", wrace.shape)
wworkout=result[(result['workout_type']=='Workout')]
print("Shape of dataframe with workout type only workout is ", wworkout.shape)
Above piece of the code divide the result dataframe into three sub categories wride with which has only ride type of workout, wrace which has only race category records and wworkout which has only workout type data.
plt.scatter(wride["distance"],wride["Elevation Gain"],color='green', label="Ride")
plt.scatter(wrace["distance"],wrace["Elevation Gain"],color='red', label="Race")
plt.scatter(wworkout["distance"],wworkout["Elevation Gain"],color='blue',label="Workout")
plt.xlabel("Distance")
plt.ylabel("Elevation Gain")
plt.legend()
plt.show()
In the above scatter plot between Distance and elapsed_time for Ride, Race and Workout. Distance is the how far they travel from the starting time and elevation gain is how many meteres climbed during the ride. We draw looming conclusions from the scatter plot :
plt.scatter(wride["elapsed_time"],wride["TSS"],color='black', label="Ride")
plt.scatter(wrace["elapsed_time"],wrace["TSS"],color='orange', label="Race")
plt.scatter(wworkout["elapsed_time"],wworkout["TSS"],color='yellow',label="Workout")
plt.xlabel("Elapsed Time")
plt.ylabel("Training Stress Score")
plt.legend()
plt.show()
In the above scatter plot between Elapsed time and TSS for Ride, Race and Workout. Elapsed time is the time taken for the ride and TSS stands for Trainig Stress Score , it detemines how hard the ride was. We come to below conclusions :
plt.scatter(wride["TSS"],wride["Calories (HR)"],color='blue', label="Ride")
plt.scatter(wrace["TSS"],wrace["Calories (HR)"],color='red', label="Race")
plt.scatter(wworkout["TSS"],wworkout["Calories (HR)"],color='yellow',label="Workout")
plt.xlabel("TSS")
plt.ylabel("Calories (HR)")
plt.legend()
plt.show()
In the above scatter plot between TSS and Calories(HR) for Ride, Race and Workout. Calories (HR) is the calorie expendature as estimated from heart rate data.and TSS stands for Trainig Stress Score , it detemines how hard the ride was. We come to below conclusions :
distdf=pd.concat([wride['Distance'],wrace['Distance'],wworkout['Distance']], axis=1, keys=['Ride', 'Race','WorkOut'])
distdf.boxplot()
plt.ylabel("Distance")
plt.show()
From the above boxplots, we can draw the following conclusions for Distance:
distdf=pd.concat([wride['Calories (HR)'],wrace['Calories (HR)'],wworkout['Calories (HR)']], axis=1, keys=['Ride', 'Race','WorkOut'])
distdf.boxplot()
plt.ylabel("Calories (HR)")
plt.show()
From the below boxplots, we can draw the following conclusions for Calories(HR):
distdf=pd.concat([wride['TSS'],wrace['TSS'],wworkout['TSS']], axis=1, keys=['Ride', 'Race','WorkOut'])
distdf.boxplot()
plt.ylabel("TSS")
plt.show()
From the above boxplots, we can draw the following conclusions for TSS:
CHALLENGE ANALYSIS¶kudos_rel=[result[result.columns[1:]].corr()['kudos'][:]] #result, dataframe where only rides keep whose measeure power is True
kudos_rel_df = pd.DataFrame(kudos_rel)
kudos_rel_df=kudos_rel_df.loc[:,kudos_rel_df.gt(0.7).any()]
kudos_rel_df
From the above code, we can conclude that the below factors leads to more Kudos/Likes:
distance ( from strava file), Distance (from GoldenCheetah), Work, Aerobic TISS, P6 Time in Pace Zone, Distance Swim, L7 Time in Zone, TRIMP Zonal Points, W' Work and W2 W'bal Moderate Fatigue variables with a correlation of above 0.7 for each with kudos.
fig,axs=plt.subplots(3,1,figsize=(8,20))
#Kudos relationship with distance
axs[0].scatter(result['kudos'], result['Distance'], color='red')
axs[0].set_xlabel('kudos' ,size =20)
axs[0].set_ylabel('Distance',size =20)
axs[0].set_title('Distance with Kudos',size =20)
#Kudos relationship with Elevation Gain
axs[1].scatter(result['kudos'], result['Elevation Gain'], color='black')
axs[1].set_xlabel('kudos',size =20)
axs[1].set_ylabel('Elevation Gain',size =20)
axs[1].set_title('Elevation Gain with Kudos',size =20)
#Kudos relationship with NP
axs[2].scatter(result['kudos'], result['NP'], color='green')
axs[2].set_xlabel('kudos',size =20)
axs[2].set_ylabel('NP',size =20)
axs[2].set_title('NP with Kudos',size =20)
plt.show()
Above scatter plots depicts the relationship between kudos and main variables such distance, elevation_gain and NP.
We can summarize below scatter plots as below :
1) In the first scatter plot between kudos and Distance, it illustrates as the distance is increasing there is increase in the kduos as well. This clearly depicts that distanc and kudos are proportional to each other.
2) In the second scatter plot between kudos and elevation_gain, it shows mixed trend between the two variables. Elevation_gain increases but kudos remains constant and vice versa.
3) In the last scatter plot between kudos and NP, it shows that relationship between these two is starting from the high value not from the 0.
month_name=[]
for e in result.index:
month_name.append((calendar.month_name[e.month]))
result.insert(2, "Month", month_name, True)
month_data_df=pd.DataFrame(result.groupby(['Month']).distance.sum())
month_data_df['TSS']=result.groupby(['Month']).TSS.sum()
month_data_df['Average Speed']=result.groupby(['Month'])['Average Speed'].mean()
month_data_df
Above code is to predict get the distance travelled ,TSS and average speed month wise. It is clearly show that most distance is travelled in the March month and it has highest TSS score as well whereas September has highest average speed recorded.
There two bar graphs in the end which depicts this analysis in a graph form for better understanding.
month_data_df.plot( y=["distance", "Average Speed", "TSS"], kind="bar")
plt.rcParams["figure.figsize"] = (8,10)
plt.legend(loc='best', prop={'size': 8})
plt.show()
month_data_df.plot( y=["Average Speed"], kind="bar", color="c")
plt.rcParams["figure.figsize"] = (9,11)
plt.legend(loc='best', prop={'size':10})
plt.show()
We can conclude that March month has the highest distance travelled by the cyclists and their TSS score is also maximum in the same month, whereas September month has lowest the distance travelled.
Average Speed is highest during the September month .
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
from pandas import DataFrame
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import calendar
from matplotlib import pyplot as plt
from datetime import timedelta
from datetime import time
from datetime import date
from datetime import datetime
plt.style.use('seaborn')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
The first dataset is the complete data which consists all data of the energy consumption. Then we have two more datasets training and testing datasets which are the subset of the complete data. We wil train a certain set of data and test on other data to check the correctness of the model developed.
We are changing data type of date column to the date for the analysis.
complete_data = pd.read_csv('data/energydata_complete.csv',parse_dates=True)
complete_data["date"]=pd.to_datetime(complete_data["date"])
complete_data.head()
date time : year-month-day hour:minute:second Appliances : energy use in Whlights : energy use of light fixtures in the house in WhT1 : Temperature in kitchen area in CelsiusRH_1 : Humidity in kitchen area in %T2 : Temperature in living room area in CelsiusRH_2 : Humidity in living room area in %T3 : Temperature in laundry room areaRH_3 : Humidity in laundry room area in %T4 : Temperature in office room in CelsiusRH_4 : Humidity in office room in %T5 : Temperature in bathroom in CelsiusRH_5 : Humidity in bathroom in %T6 : Temperature outside the building (north side) in CelsiusRH_6 : Humidity outside the building (north side) in %T7 : Temperature in ironing room in CelsiusRH_7 : Humidity in ironing room in %T8 : Temperature in teenager room 2 in CelsiusRH_8 : Humidity in teenager room 2 in %T9 : Temperature in parents room in CelsiusRH_9 : Humidity in parents room in %To : Temperature outside (from Chièvres weather station) in CelsiusPressure (from Chièvres weather station) in mm HgRH_out : Humidity outside (from Chièvres weather station)in %Windspeed (from Chièvres weather station) in m/sVisibility (from Chièvres weather station) in kmTdewpoint (from Chièvres weather station) °Crv1 : Random variable 1 nondimensionalrv2 : Rnadom variable 2 ondimensionalcomplete_data.info()
We are checking the data types of the all variables used in the dataset. date column is date type , appliances & lights are integer type and all other variables are float type.
sns.heatmap(complete_data.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')
complete_data.isnull().sum()
We are checking any null values in the dataset and we find that there is no null value in the dataset.
complete_data.describe()
We are checking the statistical variables of the dataset such as count, mean (average), standard deviation ,many more.
energy data¶date_appliances_data=complete_data[["date","Appliances"]]
date_appliances_data.plot(kind='line',x='date', y='Appliances',color='brown')
plt.xlabel("Time" ,size =20)
plt.ylabel("Appliances Wh" , size =20)
plt.subplots_adjust(right=3)
#1st week data
date_appliances_data[1:1008].plot(kind='line',x='date', y='Appliances',color='grey')
plt.xlabel("Time 1 week",size =20)
plt.ylabel("Appliances Wh",size =20)
plt.subplots_adjust(right=3)
There are two line graphs plotted, first is depicting the whole data between appliances and date. Whereas, second graph illustrates one week usage of appliances. Highest amount of energy consumed between 15th and 16th of January.
date_appliances_data['Appliances'].plot.hist(bins=40,grid=True,color='green')
plt.subplots_adjust(right=2)
plt.xlim([0,1200])
plt.xlabel("Appliances Wh")
plt.ylim(0,10000)
plt.ylabel("Frequency")
Histogram plot depicts the appliances usage with frequency on the y axis of range 0,1000 . We can assume that the distribution is skewed because tail is tilted towards the right.
ax = sns.boxplot(x=date_appliances_data["Appliances"])
plt.xlabel("Appliances Wh")
plt.subplots_adjust(right=3)
plt.show()
As the above boxplot depicts there are many outliers in the data, it means there are plenty of values are outside of the range.
train_data = pd.read_csv('data/energydata_training.csv' ,index_col='date')
print ("Shape of the training dataset is : ",train_data.shape)
train_data.head()
We are reading the training dataset and it has shape 14803 * 31 (i.e., 14803 rows and 31 columns).
train_data.describe()
test_data = pd.read_csv('data/energydata_testing.csv' ,index_col='date')
print ("Shape of the testing dataset is : ",train_data.shape)
test_data.head()
We are reading the testing dataset and it has shape 14803 * 31 (i.e., 14803 rows and 31 columns).
test_data.describe()
train_data_corr =train_data [["Appliances", "lights","T1","RH_1","T2","RH_2","T3","RH_3"]]
sns.pairplot(train_data_corr)
Plotting the first pairplot by using the training data between "Appliances", "lights","T1","RH_1","T2","RH_2","T3" and "RH_3. From this pairplot, we can make below conclusions :
train_data_corr=train_data_corr.corr()
train_data_corr
Above piece of code shows the correlation matrix between various variables. It is similar to the pairplot , only difference is the pairplot visualize correlation with graphs whereas correlation matrix illustrates matrix.
complete_data['day_name'] = complete_data['date'].dt.weekday_name;
complete_data['hour'] = complete_data['date'].dt.hour;
We are creating two new variables , assigining weekday name to day_name column and hour to hour column.
Firstmon = complete_data.loc[(complete_data.date >= '2016-01-01') & (complete_data.date <= '2016-01-28')]
Secondmon = complete_data.loc[(complete_data.date >= '2016-02-01') & (complete_data.date <= '2016-02-28')]
Thirdmon = complete_data.loc[(complete_data.date >= '2016-03-01') & (complete_data.date <= '2016-03-28')]
Fourthmon = complete_data.loc[(complete_data.date >= '2016-04-01') & (complete_data.date <= '2016-04-28')]
Firstmonv2 = pd.pivot_table(Firstmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')
Secondmonv2 = pd.pivot_table(Secondmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')
Thirdmonv2 = pd.pivot_table(Thirdmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')
Fourthmonv2 = pd.pivot_table(Fourthmon[['day_name','hour','Appliances']],index=['day_name','hour'],aggfunc='sum')
Firstmonv3 =Firstmonv2.unstack(level=0)
Secondmonv3 =Secondmonv2.unstack(level=0)
Thirdmonv3 =Thirdmonv2.unstack(level=0)
Fourthmonv3 =Fourthmonv2.unstack(level=0)
Firstmonv3 = Firstmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)
Secondmonv3 = Secondmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)
Thirdmonv3 = Thirdmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)
Fourthmonv3 = Fourthmonv3.reindex(labels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],axis=1,level=1)
day_short_names = ['Sun','Mon','Tues','Wed','Thurs','Fri','Sat']
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Firstmonv3,cmap="YlGnBu",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Secondmonv3,cmap="BuPu",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Thirdmonv3,cmap="Blues",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
f,ax =plt.subplots(figsize=(5,15))
ax=sns.heatmap(Fourthmonv3,cmap="Greens",linewidths=.5, xticklabels=day_short_names)
ax.set(xlabel = 'Day of the Week',ylabel='Hour of Day')
train_data_corr2 =train_data [["Appliances", "T4", "RH_4","T5", "RH_5", "T6", "RH_6"]]
sns.pairplot(train_data_corr2,diag_kind = 'kde',plot_kws = {'alpha': 0.4, 's': 80, 'edgecolor': 'y'})
Plotting the second pairplot by using the training data between the columns Appliances,T4,RH_4,T5,RH_5,T6,RH_6
We can draw the below conclusions from this plot:
train_data_corr2.corr()
Above piece of code shows the correlation matrix between various variables. It is similar to the pairplot , only difference is the pairplot visualize correlation with graphs whereas correlation matrix illustrates matrix.
train_data_corr3 =train_data [["Appliances", "T7", "RH_7","T8", "RH_8", "T9","RH_9"]]
sns.pairplot(train_data_corr3,diag_kind = 'kde',plot_kws = {'alpha': 0.2, 's': 50, 'edgecolor': 'k'})
Pairplot shows the correlation between various variables and to make it more understandable. T7 has low correlation with RH_7 , RH_8 and RH_9 where Appliances has low correlation with RH_7 and RH_8 and RH_9.
train_data_corr3.corr()
Above piece of code shows the correlation matrix between various variables. It is similar to the pairplot , only difference is the pairplot visualize correlation with graphs whereas correlation matrix illustrates matrix.
train_data_corr4 =train_data [["Appliances", "T_out", "Press_mm_hg", "RH_out", "Windspeed","Visibility", "Tdewpoint", "NSM","T6"]]
sns.pairplot(train_data_corr4,diag_kind = 'kde',plot_kws = {'alpha': 0.2, 's': 100, 'edgecolor': 'b'})
Pairplot shows the correlation between various variables and to make it more understandable.We can conclude that Appliances and NSM are highley correlated whereas it has weakest correlation with Visibility.
train_data_corr4.corr()
train_data_heatmap =plt.subplots(figsize=(10,10))
ax=sns.heatmap(train_data_corr4.corr(),cmap="Blues",linewidths=1.5,annot=True)
sns.distplot(train_data['Appliances'])
Plotting a distplot for appliances to draw conclusion regarding the spread of the data. From this distplot, we can conclude that data is skewed to the right.
Let's begin training of data...
We first need to split the data into two arrays X,y; in which X has the variables which we train and y has target (whose value will be predicted by the trained variables.)
We will remove two columns namely WeekStatus and Day_of_week because both these columns are storing categorical values that the linear regression model cannot use.
train_data.columns
X_train = train_data[['lights',
'T1', 'RH_1', 'T2', 'RH_2', 'T3',
'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
'Visibility', 'Tdewpoint', 'rv1', 'rv2','NSM']]
print (X_train.shape)
y_train = train_data['Appliances']
print (y_train.shape)
X array has only 28 columns because 2 variables are categorical type so they are dropped. We are printing the X and y arrays.
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train,y_train)
print(linear_model.intercept_)
print(linear_model.coef_)
coeff_df = pd.DataFrame(linear_model.coef_, X_train.columns,columns=['Coefficient'])
coeff_df
Above piece of code printing the coefficient of all variables, T3 has the highest model coefficient.
predictions_df = linear_model.predict(X_train)
comparison_df= pd.DataFrame({"Appliances Actual Value": y_train, "Appliances Predicted Value": predictions_df})
comparison_df.head()
X_test = test_data[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
'Visibility', 'Tdewpoint', 'rv1', 'rv2','NSM']]
print (X_test.shape)
y_test = test_data['Appliances']
print (y_test.shape)
We first need to split the data into two arrays X,y; in which X has the variables which we train and y has target (whose value will be predicted by the trained variables.)
We will remove two columns namely WeekStatus and Day_of_week because both these columns are storing categorical values that the linear regression model cannot use.
predictions = linear_model.predict(X_test)
mse = ((y_test - linear_model.predict(X_test))**2).mean()
print("RMSE:", np.sqrt(mse))
RMSE : RMSE is a quadratic scoring rule that also calculates the average magnitude of the error. It's the square root of the average squared differences between prediction and actual observation.
print("MSE:", mean_squared_error(y_test, predictions))
print("R^2:", r2_score(y_test, predictions))
MSE :The Mean Squared Error (MSE) of an estimator measures the average of error squares i.e. the average squared difference between the estimated values and true value.
R-squared (R2) : is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model
comparison_test_df= pd.DataFrame({"Actual Appliances' values": y_test, "Predicted Appliances' values": predictions})
comparison_test_df.head()
scatterplot =plt.subplots(figsize=(10,10))
ax=plt.scatter(abs(y_test-predictions),test_data['Appliances'])
plt.xlim([-100,1200])
plt.xlabel("Appliances Energy Consumption", size = 10)
plt.ylim(-300,1400)
plt.ylabel("Residuals", size = 10)
From the above plot, we can conclude that variables used in the linear model for the estimate are not good enough because resdiuals are not distributed around the horizontal line.
linear_model = LinearRegression()
rfe = RFE(linear_model, 2)
X_rfe = rfe.fit_transform(X_train,y_train)
linear_model.fit(X_rfe,y_train)
print(rfe.support_)
print(rfe.ranking_)
From the above analysis, we can say that Appliances value is predicted by taking all the independent columns from the data and here we got high MSE.
RFE (i.e., Recursive Feature Elminitaion ) method to make predictions based on selected variables only.¶nof_list=np.arange(1,28)
high_score=0
nof=0
score_list =[]
for n in range(len(nof_list)):
model = LinearRegression()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
cols = list(X_train.columns)
model = LinearRegression()
rfe = RFE(model, 25)
X_rfe = rfe.fit_transform(X_train,y_train)
model.fit(X_rfe,y_train)
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print(selected_features_rfe)
X_train = train_data[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5',
'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out',
'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint']]
print (X_train.shape)
linear_model_new = LinearRegression()
linear_model_new.fit(X_train,y_train)
X_test = test_data[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5',
'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out',
'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint']]
print (X_test.shape)
predictions = linear_model_new.predict(X_test)
mse = ((y_test - linear_model_new.predict(X_test))**2).mean()
print("RMSE:", np.sqrt(mse))
print("MSE:", mean_squared_error(y_test, predictions))
print("R^2:", r2_score(y_test, predictions))
compare_df_rfe= pd.DataFrame({"Actual Appliances' values": y_test, "Predicted Appliances' values": predictions})
compare_df_rfe.head()
scatterplot =plt.subplots(figsize=(10,10))
x=plt.scatter(abs(y_test-predictions),test_data['Appliances'])
plt.xlim([-100,1200])
plt.xlabel("Appliances Energy Consumption", size = 10)
plt.ylim(-300,1400)
plt.ylabel("Residuals", size = 10)
We can see that still the Mean Squared Error has a quite high value even after using the Recursive feature elimination method.
We can say that a good model depends upon the number of variables used in the model.In this portfolio, model can be make better if we use significant number of variables in prediction.
K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated.
We first generate random data around 4 centers.
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
%matplotlib inline
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])
#Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4
data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)
plt.scatter(data[:,0], data[:,1], s=7, c='k')
plt.show()
print("Shape of the dataset is :", data.shape)
data
Above piece of code shows the shape of the dataset for which different centroids will be calculated.
Data is 2D in shape and also printing the data as well.
You need to generate four random centres.
This part of portfolio should contain at least:
k is set to 4;centres = np.random.randn(k,c)*std + mean where std and mean are the standard deviation and mean of the data. c represents the number of features in the data. Set the random seed to 6.green, blue, yellow, and cyan. Set the edgecolors to red.k=4
mean=data.mean()
std=np.std(data)
print ("The mean for the normalised data is:",mean)
print ("The standard deviation for the normalised data is:",std)
np.random.seed(6)
centres = np.random.randn(k,2)*std + mean #centres variable stores value of the calculated centroids
print(centres)
Random series has been generated and clusters are chosen 4 (k=4). Centres are generated as per the given formula in the description above (**np.random.randn(k,c)*std + mean, where std is the standard deviation - measure that is used to depict the amount of variation of data values from the avrerage value and mean is the average of the values)
def gen_random_centres():
plt.scatter(data[:,0], data[:,1], s=6, c='k')
color_coding=['g','b','y','c']
for i in range(4):
color=color_coding[i]
plt.scatter(centres[i][0],centres[i][1],marker='*', s=400, c=color,edgecolor='r')
gen_random_centres()
Above mentioned function gen_random_centres perform the plotting of different centroids on the data arbitrary with the defined color coding of green, blue,yellow and cyan color and edge color as red.
For loop is running 4 times because we have to update 4 centroids.
Kmeans Algorithm¶Basic steps involve in the Kmeans algorthim :
clusters=np.zeros(len(data))
clusters.shape
def euc_dist(point, centroid, ax=1):
return np.linalg.norm(point - centroid, axis=ax)
def cluster_search():
for i in range(len(data)):
distances = euc_dist(data[i],centres)
cluster = np.argmin(distances)
clusters[i] = cluster
A function cluster_search has been defined to find the nearest cluster to assign a respective centroid to that cluster.
def centres_search():
for d in range(k):
pts=[data[j] for j in range(len(data)) if clusters[j]==d]
centres[d]=np.mean(pts,axis=0)
A function named centres_search has been defined to find the new centres using the averages of the points which belong to the same cluster.
cluster_search()
print(clusters)
centres_search()
print(centres)
gen_random_centres()
Printing the new centres in the form of an array.
cluster_search()
centres_search()
print(centres)
gen_random_centres()
Above piece of code shows 2nd iteration of updating centroids. gen_random_centres() prints the new centres.
cluster_search()
centres_search()
print(centres)
gen_random_centres()
Above piece of code shows 3rd iteration of updating centroids. gen_random_centres() prints the new centres.
cluster_search()
centres_search()
print(centres)
gen_random_centres()
Above piece of code shows 4th iteration of updating centroids. gen_random_centres() prints the new centres.
cluster_search()
centres_search()
print(centres)
gen_random_centres()
Above piece of code shows 5th iteration of updating centroids. gen_random_centres() prints the new centres.
We can see that in the last iteration there is no significant change in the centroids, here we can hault our process and we can say that aformentioned plot is the final one.